INTERSPEECH.2022 - Analysis and Assessment | Cool Papers

#1 SAQAM: Spatial Audio Quality Assessment Metric [PDF] [Copy] [Kimi²]

Authors: Pranay Manocha ; Anurag Kumar ; Buye Xu ; Anjali Menon ; Israel Degene Gebru ; Vamsi Krishna Ithapu ; Paul Calamia

Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining "gold standard” human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning framework to assess listening quality (LQ) and spatialization quality (SQ) between any given pair of binaural signals without using any subjective data. We model LQ by training on a simulated dataset of triplet human judgments, and SQ by utilizing activation-level distances from networks trained for direction of arrival (DOA) estimation. We show that SAQAM correlates well with human responses across four diverse datasets. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. For example, simply replacing an existing loss with our metric yields improvement in a speech-enhancement network.

#2 Speech Quality Assessment through MOS using Non-Matching References [PDF] [Copy] [Kimi²]

Authors: Pranay Manocha ; Anurag Kumar

Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a speech signal. Unlike prior works, our approach uses non-matching references as a form of conditioning to ground the MOS estimation by neural networks. We show that NORESQA-MOS provides better generalization and more robust MOS estimation than previous state-of-the-art methods such as DNSMOS and NISQA, even though we use a smaller training set. Moreover, we also show that our generic framework can be combined with other learning methods such as self-supervised learning and can further supplement the benefits from these methods.

#3 An objective test tool for pitch extractors' response attributes [PDF] [Copy] [Kimi²]

Authors: Hideki Kawahara ; Kohei Yatabe ; Ken-Ichi Sakakibara ; Tatsuya Kitamura ; Hideki Banno ; Masanori Morise

We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. It enables us to evaluate different pitch extractors with unified criteria. The method uses extended time-stretched pulses combined by binary orthogonal sequences. It provides simultaneous measurement results consisting of the linear and the non-linear time-invariant responses and random and time-varying responses. We tested representative pitch extractors using fundamental frequencies spanning 80~Hz to 800~Hz with 1/48 octave steps and produced more than 2000 modulation frequency response plots. We found that making scientific visualization by animating these plots enables us to understand different pitch extractors' behavior at once. Such efficient and effortless inspection is impossible by inspecting all individual plots. The proposed measurement method with visualization leads to further improvement of the performance of one of the extractors mentioned above. In other words, our procedure turns the specific pitch extractor into the best reliable measuring equipment that is crucial for scientific research. We open-sourced MATLAB codes of the proposed objective measurement method and visualization procedure.

#4 Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection [PDF] [Copy] [Kimi²]

Authors: Kai Li ; Sheng Li ; Xugang Lu ; Masato Akagi ; Meng Liu ; Lin Zhang ; Chang Zeng ; Longbiao Wang ; Jianwu Dang ; Masashi Unoki

Fake audio detection (FAD) is a technique to distinguish synthetic speech from natural speech. In most FAD systems, removing irrelevant features from acoustic speech while keeping only robust discriminative features is essential. Intuitively, speaker information entangled in acoustic speech should be suppressed for the FAD task. Particularly in a deep neural network (DNN)-based FAD system, the learning system may learn speaker information from a training dataset and cannot generalize well on a testing dataset. In this paper, we propose to use the speaker anonymization (SA) technique to suppress speaker information from acoustic speech before inputting it into a DNN-based FAD system. We adopted the McAdams-coefficient-based SA (MC-SA) algorithm, and this is expected that the entangled speaker information will not be involved in the DNN-based FAD learning. Based on this idea, we implemented a light convolutional neural network bidirectional long short-term memory (LCNN-BLSTM)-based FAD system and conducted experiments on the Audio Deep Synthesis Detection Challenge (ADD2022) datasets. The results showed that removing the speaker information from acoustic speech improved the relative performance in the first track of ADD2022 by 17.66%.

#5 Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning [PDF] [Copy] [Kimi²]

Authors: Salah Zaiem ; Titouan Parcollet ; Slim Essid

Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation techniques are usually exploited to help enforce desired invariances within the learned representations, improving performance on various audio tasks thanks to more robust embeddings. Now, selecting the most relevant augmentations has proven crucial for better downstream performances. Thus, this work introduces a conditional independance-based method which allows for automatically selecting a suitable distribution on the choice of augmentations and their parametrization from a set of predefined ones, for contrastive self-supervised pre-training. This is performed with respect to a downstream task of interest, hence saving a costly hyper-parameter search. Experiments performed on two different downstream tasks validate the proposed approach showing better results than experimenting without augmentation or with baseline augmentations. We furthermore conduct a qualitative analysis of the automatically selected augmentations and their variation according to the considered final downstream dataset.

#6 Transformer-based quality assessment model for generalized user-generated multimedia audio content [PDF] [Copy] [Kimi²]

Authors: Deebha Mumtaz ; Ajit Jena ; Vinit Jakhetiya ; Karan Nathwani ; Sharath Chandra Guntuku

In this paper, we propose a computational measure for the quality of audio in user-generated multimedia (UGM) in accordance with the human perceptual system. To this end, we first extend the previously proposed IIT-JMU-UGM Audio dataset by including samples with more diverse context, content, distortion types, and intensities, along with implicitly distorted audio that reflect realistic scenarios. We conduct subjective testing on the extended database containing 2075 audio clips to obtain the mean opinion scores for each sample. We then introduce transformer-based learning to the domain of audio quality assessment, which is trained on three vital audio features: Mel-frequency cepstral coefficients, chroma, and Mel-scaled spectrogram. The proposed non-intrusive transformer-based model is compared against state-of-the-art methods and found to outperform Simple RNN, LSTM, and GRU models by over 4%. The database and the source code will be made public upon acceptance.

#7 Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification [PDF] [Copy] [Kimi²]

Authors: Parvaneh Janbakhshi ; Ina Kodrasi

Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.

#8 Automated Detection of Wilson’s Disease Based on Improved Mel-frequency Cepstral Coefficients with Signal Decomposition [PDF] [Copy] [Kimi²]

Authors: Zhenglin Zhang ; Li-Zhuang Yang ; Xun Wang ; Hai Li

Wilson's disease (WD), a rare genetic movement disorder, is characterized by early-onset dysarthria. Automated speech assessment is thus valuable in early diagnosis and intervention. Time-frequency features, such as Mel-frequency cepstral coefficients (MFCC), have been frequently used. However, human speech signals are nonlinear and nonstationary, which cannot be captured by traditional features based on the Fourier transform. Moreover, the dysarthria type of WD patients is complex and different from other movement disorders such as Parkinson's disease. Thus, developing sensitive time-frequency measures for WD patients is needed. The present study proposes DMFCC, the improved MFCC using signal decomposition. We validate the usefulness of DMFCC in WD detection with a sample of 60 WD patients and 60 matched healthy controls. Results show that the DMFCC achieves the best classification accuracy (86.1%), improving by 13.9%-44.4% compared to baseline features such as MFCC and the state-of-art Hilbert cepstral coefficients (HCCs). The present study is a first attempt to demonstrate the validity of automated acoustic measures in WD detection, and the proposed DMFCC provides a novel tool for speech assessment.

#9 The effect of backward noise on lexical tone discrimination in Mandarin-speaking amusics [PDF] [Copy] [Kimi²]

Authors: Zixia Fan ; Jing Shao ; Weigong Pan ; Min Xu ; Lan Wang

Congenital amusia is a neurogenetic disorder, affecting music pitch processing. It also transfers to the language domain and negatively influences the perception of linguistic components relying on pitch, such as lexical tones. It has been well established that unfavorable listening conditions impact lexical tone perception in amusics. For instance, both Mandarin- and Cantonese-speaking amusics were impaired in tone processing under simultaneously noisy conditions. Backward noise is one of the adverse listening conditions, but its interference mechanism is distinct from the simultaneous noise. Therefore, it warrants more studies to explore whether and how backward masking noise affects tone processing in amusics. In the current study, eighteen Mandarin-speaking amusics and 18 controls were tested on discrimination of Mandarin tones under two conditions: a quiet condition involving relatively low-level processing and a backward masking condition involving high-level processing (e.g., tone categorization) where a native multi-talker babble noise was added to target tones. The results revealed that amusics performed similarly to controls in quiet conditions, whereas poorer performance in backward noise conditions. These findings shed light on how adverse listening environments influence amusics' lexical tone processing and provided further empirical evidence that amusics may be impaired in the high-level phonological processing of lexical tone.

#10 Automatic Selection of Discriminative Features for Dementia Detection in Cantonese-Speaking People [PDF] [Copy] [Kimi²]

Authors: Xiaoquan KE ; Man-Wai Mak ; Helen M. Meng

Dementia is a severe cognitive impairment that affects the health of older adults and creates a burden on their families and caretakers. This paper analyzes diverse features extracted from spoken languages and selects the most discriminative features for dementia detection. The paper presents a deep learning-based feature ranking method called dual-net feature ranking (DFR). The proposed DFR utilizes a dual-net architecture, where two networks (called operator and selector) are alternatively and cooperatively trained to simultaneously perform feature selection and dementia detection. The DFR interprets the contribution of individual features to the predictions of the selector network using all of the selector's parameters. The DFR was evaluated on the Cantonese JCCOCC-MoCA Elderly Speech Dataset. Results show that the DFR can significantly reduce feature dimensionality while identifying small feature subsets with comparable or superior performance than the whole feature set. The selected features have been uploaded to https://github.com/kexquan/AD-detection-Feature-selection.

#11 Automated Voice Pathology Discrimination from Continuous Speech Benefits from Analysis by Phonetic Context [PDF] [Copy] [Kimi²]

Authors: Zhuoya Liu ; Mark Huckvale ; Julian McGlashan

In contrast to previous studies that look only at discriminating pathological voice from the normal voice, in this study we focus on the discrimination between cases of spasmodic dysphonia (SD) and vocal fold palsy (VP) using automated analysis of speech recordings. The hypothesis is that discrimination will be enhanced by studying continuous speech, since the different pathologies are likely to have different effects in different phonetic contexts. We collected audio recordings of isolated vowels and of a read passage from 60 patients diagnosed with SD (N=38) or VP (N=22). Baseline classifiers on features extracted from the recordings taken as a whole gave a cross-validated unweighted average recall of up to 75% for discriminating the two pathologies. We used an automated method to divide the read passage into phone-labelled regions and built classifiers for each phone. Results show that the discriminability of the pathologies varied with phonetic context as predicted. Since different phone contexts provide different information about the pathologies, classification is improved by fusing phone predictions, to achieve a classification accuracy of 83%. The work has implications for the differential diagnosis of voice pathologies and contributes to a better understanding of their impact on speech.

#12 Multi-Type Outer Product-Based Fusion of Respiratory Sounds for Detecting COVID-19 [PDF] [Copy] [Kimi²]

Authors: Adria Mallol-Ragolta ; Helena Cuesta ; Emilia Gomez ; Björn Schuller

This work presents an outer product-based approach to fuse the embedded representations learnt from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of specific Convolutional Neural Networks (CNNs) trained from scratch and ResNet18-based CNNs fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms are beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06 % is obtained on the test partition when using specific CNNs trained from scratch with contextual attention mechanisms. When using ResNet18-based CNNs for feature extraction, the baseline model scores the highest performance with an AUC of 84.26 %.

#13 Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics [PDF] [Copy] [Kimi²]

Authors: Xueshuai Zhang ; Jiakun Shen ; Jun Zhou ; Pengyuan Zhang ; Yonghong Yan ; Zhihua Huang ; Yanfen Tang ; Yu Wang ; Fujie Zhang ; Shaoxing Zhang ; Aijun Sun

A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method.

#14 Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers [PDF] [Copy] [Kimi²]

Authors: Farhad Javanmardi ; Sudarsana Reddy Kadiri ; Manila Kodali ; Paavo Alku

The present study investigates the use of 1-dimensional (1-D) and 2-dimensional (2-D) spectral feature representations in voice pathology detection with several classical machine learning (ML) and recent deep learning (DL) classifiers. Four popularly used spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram) are derived in both the 1-D and 2-D form from voice signals. Three widely used ML classifiers (support vector machine (SVM), random forest (RF) and Adaboost) and three DL classifiers (deep neural network (DNN), long short-term memory (LSTM) network, and convolutional neural network (CNN)) are used with the 1-D feature representations. In addition, CNN classifiers are built using the 2-D feature representations. The popularly used HUPA database is considered in the pathology detection experiments. Experimental results revealed that using the CNN classifier with the 2-D feature representations yielded better accuracy compared to using the ML and DL classifiers with the 1-D feature representations. The best performance was achieved using the 2-D CNN classifier based on dynamic MFCCs that showed a detection accuracy of 81%.

#15 Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition [PDF] [Copy] [Kimi²]

Authors: Gerasimos Chatzoudis ; Manos Plitsis ; Spyridoula Stamouli ; Athanasia–Lida Dimou ; Nassos Katsamanis ; Vassilis Katsouros

Aphasia is a common speech and language disorder, typically caused by a brain injury or a stroke, that affects millions of people worldwide. Detecting and assessing Aphasia in patients is a difficult, time-consuming process, and numerous attempts to automate it have been made, the most successful using machine learning models trained on aphasic speech data. Like in many medical applications, aphasic speech data is scarce and the problem is exacerbated in so-called ``low resource" languages, which are, for this task, most languages excluding English. We attempt to leverage available data in English and achieve zero-shot aphasia detection in low-resource languages such as Greek and French, by using language-agnostic linguistic features. Current cross-lingual aphasia detection approaches rely on manually extracted transcripts. We propose an end-to-end pipeline using pre-trained Automatic Speech Recognition (ASR) models that share cross-lingual speech representations and are fine-tuned for our desired low-resource languages. To further boost our ASR model's performance, we also combine it with a language model. We show that our ASR-based end-to-end pipeline offers comparable results to previous setups using human-annotated transcripts.

#16 Domain-aware Intermediate Pretraining for Dementia Detection with Limited Data [PDF] [Copy] [Kimi²]

Authors: Youxiang Zhu ; Xiaohui Liang ; John A. Batsis ; Robert M. Roth

Detecting dementia using human speech is promising but faces a limited data challenge. While recent research has shown general pretrained models (e.g., BERT) can be applied to improve dementia detection, the pretrained model can hardly be fine-tuned with the available small dementia dataset as that would raise the overfitting problem. In this paper, we propose a domain-aware intermediate pretraining to enable a pretraining process using a domain-similar dataset that is selected by incorporating the knowledge from the dementia dataset. Specifically, we use pseudo-perplexity to find an effective pretraining dataset, and then propose dataset-level and sample-level domain-aware intermediate pretraining techniques. We further employ information units (IU) from previous dementia research and define an IU-pseudo-perplexity to reduce calculation complexity. We confirm the effectiveness of perplexity by showing a strong correlation between perplexity and accuracy using 9 datasets and models from the GLUE benchmark. We show that our domain-aware intermediate pretraining improves detection accuracy in almost all cases. Our results suggested that the difference in text-based perplexity values between patients with Alzheimer's Dementia and Healthy Control is still small, and the perplexity incorporating acoustic features (e.g., pause) may make the pretraining more effective.

#17 Comparison of 5 methods for the evaluation of intelligibility in mild to moderate French dysarthric speech [PDF] [Copy] [Kimi²]

Authors: Cécile Fougeron ; Nicolas Audibert ; Ina Kodrasi ; Parvaneh Janbakhshi ; Michaela Pernon ; Nathalie Leveque ; Stephanie Borel ; Marina Laganaro ; Herve Bourlard ; Frederic Assal

Altered quality of the phonetic-acoustic information in the speech signal in the case of motor speech disorders may reduce its intelligibility. Monitoring intelligibility is part of the standard clinical assessment of patients. It is also a valuable tool to index the evolution of the speech disorder. However, measuring intelligibility raises methodological debates concerning: the type of linguistic material on which the assessment is based (non-words, words, continuous speech), the evaluation protocol and type of scores (scale-based rating, transcription or recognition tests), and the advantages and disadvantages of listener vs. automatic-based approaches (subjective vs. objective, expertise level, types of models used). In this paper, the intelligibility of the speech of 32 French patients presenting mild to moderate dysarthria and 17 elderly speakers is assessed with five different methods: impressionistic clinician judgment on continuous speech, number of words recognized in an interactive face-to-face setting and in an on-line testing of the same material by 75 judges, automatic feature-based and automatic speech recognition-based methods (both on short sentences). The implications of the different methods for clinical practice are discussed.

#18 A comparative study on vowel articulation in Parkinson's disease and multiple system atrophy [PDF] [Copy] [Kimi²]

Authors: Khalid Daoudi ; Biswajit Das ; Solange Milhé de Saint Victor ; Alexandra Foubert-Samier ; Margherita Fabbri ; Anne Pavy-Le Traon ; Olivier Rascol ; Virginie Woisard ; Wassilios G. Meissner

Acoustic realisation of the working vowel space has been widely studied in Parkinson's disease (PD). However, it has never been studied in atypical parkinsonian disorders (APD). The latter are neurodegenerative diseases which share similar clinical features with PD, rendering the differential diagnosis very challenging in early disease stages. This paper presents the first contribution in vowel space analysis in APD, by comparing corner vowel realisation in PD and the parkinsonian variant of Multiple System Atrophy (MSA-P). Our study has the particularity of focusing exclusively on early stage PD and MSA-P patients, as our main purpose was early differential diagnosis between these two diseases. We analysed the corner vowels, extracted from a spoken sentence, using traditional vowel space metrics. We found no statistical difference between the PD group and healthy controls (HC) while MSA-P exhibited significant differences with the PD and HC groups. We also found that some metrics conveyed complementary discriminative information. Consequently, we argue that restriction in the acoustic realisation of corner vowels cannot be a viable early marker of PD, as hypothesised by some studies, but it might be a candidate as an early hypokinetic marker of MSA-P (when the clinical target is discrimination between PD and MSA-P).

#19 Voicing decision based on phonemes classification and spectral moments for whisper-to-speech conversion [PDF] [Copy] [Kimi²]

Authors: Luc Ardaillon ; Nathalie Henrich ; Olivier Perrotin

Cordectomized or laryngectomized patients recover the ability to speak thanks to devices able to produce a natural-sounding voice source in real time. However, constant voicing can impair the naturalness and intelligibility of reconstructed speech. Voicing decision, consisting in identifying whether an uttered phone should be voiced or not, is investi- gated here as an automatic process in the context of whisper-to-speech (W2S) conversion systems. Whereas state-of-the-art approaches apply DNN techniques on high-dimensional acoustic features, we seek here a low-resource alternative approach for a perceptually-meaningful mapping between acoustic features and voicing decision, suitable for real-time applications. Our method first classifies whisper signal frames into phoneme classes based on their spectral centroid and spread, and then discriminate voiced phonemes from their unvoiced counterpart based on class-dependent spectral centroid thresholds. We compared our method to a simpler approach using a single centroid threshold on several databases of annotated whispers in both single-speaker and multi-speaker training setups. While both approaches reach voicing accuracy higher than 91%, the proposed method allows to avoid some systematic voicing decision errors, which may allow users to learn to adapt their speech in real-time to compensate for remaining voicing errors.

#20 Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks [PDF] [Copy] [Kimi²]

Authors: Tanya Talkar ; Christina Manxhari ; James Williamson ; Kara M. Smith ; Thomas Quatieri

Parkinson's disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability.

#21 Investigating the Impact of Speech Compression on the Acoustics of Dysarthric Speech [PDF] [Copy] [Kimi²]

Authors: Kelvin Tran ; Lingfeng Xu ; Gabriela Stegmann ; Julie Liss ; Visar Berisha ; Rene Utianski

Acoustic analysis plays an important role in the assessment of dysarthria. Out of a public health necessity, telepractice has become increasingly adopted as the modality in which clinical care is given. While there are differences in software among telepractice platforms, they all use some form of speech compression to preserve bandwidth, with the most common algorithm being the Opus codec. Opus has been optimized for compression of speech from the general (mostly healthy) population. As a result, for speech-language pathologists, this begs the question: is the remotely transmitted speech signal a faithful representation of dysarthric speech? Existing high-fidelity audio recordings from 20 speakers of various dysarthria types were encoded at three different bit rates defined within Opus to simulate different internet bandwidth conditions. Acoustic measures of articulation, voice, and prosody were extracted, and mixed-effect models were used to evaluate the impact of bandwidth conditions on the measures. Significant differences in cepstral peak prominence, degree of voice breaks, jitter, vowel space area, pitch, and vowel space area were observed after Opus processing, providing insight into the types of acoustic measures that are susceptible to speech compression algorithms.

#22 Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion Perception [PDF] [Copy] [Kimi²]

Authors: Avamarie Brueggeman ; John H.L. Hansen

Despite significant progress in areas such as speech recognition, cochlear implant users still experience challenges related to identifying various speaker traits such as gender, age, emotion, accent, etc. In this study, we focus on emotion as one trait. We propose the use of emotion intensity conversion to perceptually enhance emotional speech with the goal of improving speech emotion recognition for cochlear implant users. To this end, we utilize a parallel speech dataset containing emotion and intensity labels to perform conversion from normal to high intensity emotional speech. A non-negative matrix factorization method is integrated to perform emotion intensity conversion via spectral mapping. We evaluate our emotional speech enhancement using a support vector machine model for emotion recognition. In addition, we perform an emotional speech recognition listener experiment with normal hearing listeners using vocoded audio. It is suggested that such enhancement will benefit speaker trait perception for cochlear implant users.

#23 Optimal thyroplasty implant shape and stiffness for treatment of acute unilateral vocal fold paralysis: Evidence from a canine in vivo phonation model [PDF] [Copy] [Kimi²]

Authors: Neha Reddy ; Yoonjeong Lee ; Zhaoyan Zhang ; Dinesh K. Chhetri

Medialization thyroplasty is a frequently used surgical treatment for insufficient glottal closure and involves placement of an implant to medialize the vocal fold. Prior studies have been unable to determine optimal implant shape and stiffness. In this study, thyroplasty implants with various medial surface shapes (rectangular, convergent, or divergent) and stiffnesses (Silastic, Gore-Tex, soft silicone of varying stiffness, or hydrogel) were assessed for optimal voice quality in an in vivo canine model of unilateral vocal fold paralysis with graded contralateral neuromuscular stimulation to mimic expected compensation seen in patients with this laryngeal pathology. Across experiments, Silastic rectangular implants consistently result in an improved voice quality metric, indicating high-quality output phonation. These findings have clinical implications for the optimization of thyroplasty implant treatment for speakers with laryngeal pathologies causing glottic insufficiency.

#24 Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene Classification [PDF] [Copy] [Kimi²]

Authors: Byeonggeun Kim ; Seunghan Yang ; Jangho Kim ; Hyunsin Park ; Juntae Lee ; Simyung Chang

While using two-dimensional convolutional neural networks (2D-CNNs) in image processing, it is possible to manipulate domain information using channel statistics, and instance normalization has been a promising way to get domain-invariant features. Unlike image processing, we analyze that domain-relevant information in an audio feature is dominant in frequency statistics rather than channel statistics. Motivated by our analysis, we introduce Relaxed Instance Frequency-wise Normalization (RFN): a plug-and-play, explicit normalization module along the frequency axis which can eliminate instance-specific domain discrepancy in an audio feature while relaxing undesirable loss of useful discriminative information. Empirically, simply adding RFN to networks shows clear margins compared to previous domain generalization approaches on acoustic scene classification and yields improved robustness for multiple audio-devices. Especially, the proposed RFN won the DCASE2021 challenge TASK1A, low-complexity acoustic scene classification with multiple devices, with a clear margin, and this work is extended version of the work.

#25 Couple learning for semi-supervised sound event detection [PDF] [Copy] [Kimi²]

Authors: Tao Rui ; Yan Long ; Ouchi Kazushige ; Xiangdong Wang

The recently proposed Mean Teacher method, which exploits large-scale unlabeled data in a self-ensembling manner, has achieved state-of-the-art results in several semi-supervised learning benchmarks. Spurred by current achievements, this paper proposes an effective Couple Learning method that combines a well-trained model and a Mean Teacher model. The suggested pseudo-labels generated model (PLG) increases strongly- and weakly-labeled data to improve the Mean Teacher method's performance. Moreover, the Mean Teacher's consistency cost reduces the noise impact in the pseudo-labels introduced by detection errors. The experimental results on Task 4 of the DCASE2020 challenge demonstrate the superiority of the proposed method, achieving about 44.25% F1-score on the validation set without post-processing, significantly outperforming the baseline system's 32.39%. furthermore, this paper also propose a simple and effective experiment called the Variable Order Input (VOI) experiment, which proves the significance of the Couple Learning method. Our developed Couple Learning code is available on GitHub.